HubIA's DGX overview

General informations

  • OS distribution: DGX OS 5.5.0 (Ubuntu 20.04.6 LTS / GNU/Linux 5.4.0-153-generic x86_64)
  • GPU: four NVIDIA A100 with each 80 GB of GPU memory
  • CPU: Single AMD 7742, 64 cores, and 2.25 GHz (base) – 3.4 GHz (max boost), with each physical core divided into two logical cores, giving a total of 128 logical cores.
  • System memory: System DDR4 RAM, eight units of 64GB each (512GB total)
  • Data strorage: Cache/Data U.2 NVME drive (7.68TB)
  • OS storage: Boot M.2 NVME drive (1.92TB)

Multi-Instance GPU (MIG)

Each NVIDIA A100 can virtually be divided into several MIGs of 10GB, 20GB or 40GB, multiplying parallel access possibilities. From the user's point of view, this amounts to having as many GPUs. In practice, the GPU power available is as follows: 10 1g10gb MIGs, 4 2g.20gb MIGs, one 3g.40gb MIG and a full A100 (80gb), subsequently also called MIG for simplicity.

How can I use those MIGs?

Having a machine dedicated to your computation is allocating a machine. So to be able to run a program on a machine of some cluster, you need to allocate that machine.

In the case of HubIA's DGX Station A100, there's only one machine, and what you allocate is directly the MIGs according to the computing power you need.

An allocation is limited in time: a maximum duration (called walltime) comes with your allocation.

The usual story for a project is that you do not have a reservation but you freely allocate MIGs either in interactive mode for your live coding session or as batch for long running computations that do not require you to be in front of your screen.

Allocation

To use some MIGs for your computations, you need to ask the scheduler (i.e. the slurm server running on the DGX). The main commands to do so are srun (interactive mode for live coding sessions) and sbatch (for long running computations).

Available partitions are described in the page dedicated to slurm partitions and examples (calls of srun, sbatch or basic template for a batch file) can be found in the slurm jobs management page, along with details on multiple slurm directives. A full use case is available on the use case example page.